AAAI.2021 - Intelligent Robots | Cool Papers

#1 Automatic Generation of Flexible Plans via Diverse Temporal Planning [PDF] [Copy] [Kimi]

Authors: Yotam Amitai ; Ayal Taitler ; Erez Karpas

Robots operating in the real world must deal with uncertainty, be it due to working with humans who are unpredictable, or simply because they must operate in a dynamic environment. Ignoring the uncertainty is dangerous, while accounting for all possible outcomes is often computationally infeasible. One approach, which lies between ignoring the uncertainty completely and addressing it completely is using flexible plans with choice, formulated as Temporal Planning Networks (TPNs). This method has been successfully demonstrated to work in human-robot teamwork using the Pike executive, an online executive that unifies intent recognition and plan adaptation. However, one of the main challenges to using Pike is the need to manually specify the TPN. In this paper, we address this challenge by describing a technique for automatically synthesizing a TPN which covers multiple possible executions for a given temporal planning problem specified in PDDL 2.1. Our approach starts by using a diverse planner to generate multiple plans, and then merges them into a single TPN. As there were no available diverse planners for temporal planning, we first present a novel method for adapting an existing diverse planning method, based on top-k planning, to the temporal setting. We then describe how merging diverse plans into a single TPN is performed using constraint optimization. Finally, an empirical evaluation on a set of IPC benchmarks shows that our approach scales well, and generates TPNs which can generalize the set of plans they are generated from.

#2 BT Expansion: a Sound and Complete Algorithm for Behavior Planning of Intelligent Robots with Behavior Trees [PDF] [Copy] [Kimi]

Authors: Zhongxuan Cai ; Minglong Li ; Wanrong Huang ; Wenjing Yang

Behavior Trees (BTs) have attracted much attention in the robotics field in recent years, which generalize existing control architectures and bring unique advantages for building robot systems. Automated synthesis of BTs can reduce human workload and build behavior models for complex tasks beyond the ability of human design, but theoretical studies are almost missing in existing methods because it is difficult to conduct formal analysis with the classic BT representations. As a result, they may fail in tasks that are actually solvable. This paper proposes BT expansion, an automated planning approach to building intelligent robot behaviors with BTs, and proves the soundness and completeness through the state-space formulation of BTs. The advantages of blended reactive planning and acting are formally discussed through the region of attraction of BTs, by which robots with BT expansion are robust to any resolvable external disturbances. Experiments with a mobile manipulator and test sets are simulated to validate the effectiveness and efficiency, where the proposed algorithm surpasses the baseline by virtue of its soundness and completeness. To the best of our knowledge, it is the first time to leverage the state-space formulation to synthesize BTs with a complete theoretical basis.

#3 I3DOL: Incremental 3D Object Learning without Catastrophic Forgetting [PDF] [Copy] [Kimi]

Authors: Jiahua Dong ; Yang Cong ; Gan Sun ; Bingtao Ma ; Lichen Wang

3D object classification has attracted appealing attentions in academic researches and industrial applications. However, most existing methods need to access the training data of past 3D object classes when facing the common real-world scenario: new classes of 3D objects arrive in a sequence. Moreover, the performance of advanced approaches degrades dramatically for past learned classes (i.e., catastrophic forgetting), due to the irregular and redundant geometric structures of 3D point cloud data. To address these challenges, we propose a new Incremental 3D Object Learning (i.e., I3DOL) model, which is the first exploration to learn new classes of 3D object continually. Specifically, an adaptive-geometric centroid module is designed to construct discriminative local geometric structures, which can better characterize the irregular point cloud representation for 3D object. Afterwards, to prevent the catastrophic forgetting brought by redundant geometric information, a geometric-aware attention mechanism is developed to quantify the contributions of local geometric structures, and explore unique 3D geometric characteristics with high contributions for classes incremental learning. Meanwhile, a score fairness compensation strategy is proposed to further alleviate the catastrophic forgetting caused by unbalanced data between past and new classes of 3D object, by compensating biased prediction for new classes in the validation phase. Experiments on 3D representative datasets validate the superiority of our I3DOL framework.

#4 Enabling Fast Instruction-Based Modification of Learned Robot Skills [PDF] [Copy] [Kimi]

Authors: Tyler Frasca ; Bradley Oosterveld ; Meia Chita-Tegmark ; Matthias Scheutz

Much research effort in HRI has focused on how to enable robots to learn new skills from observations, demonstrations, and instructions. Less work, however, has focused on how skills can be corrected if they were learned incorrectly, adapted to changing circumstances, or generalized/specialized to different contexts. In this paper, a skill modification framework is introduced that allows users to modify a robot’s stored skills quickly through instructions to (1) reduce inefficiencies, (2) fix errors, and (3) enable generalizations, all in a way for modified skills to be immediately available for task performance. A thorough evaluation of the implemented framework shows the operation of the algorithms integrated in a cognitive robotic architecture on different fully autonomous robots in various HRI case studies. An additional online HRI user study verifies that subjects prefer to quickly modify robot knowledge in the way we proposed in the framework.

#5 Consistent Right-Invariant Fixed-Lag Smoother with Application to Visual Inertial SLAM [PDF] [Copy] [Kimi]

Authors: Jianzhu Huai ; Yukai Lin ; Yuan Zhuang ; Min Shi

State estimation problems without absolute position measurements routinely arise in navigation of unmanned aerial vehicles, autonomous ground vehicles, etc., whose proper operation relies on accurate state estimates and reliable covariances. Unaware of absolute positions, these problems have immanent unobservable directions. Traditional causal estimators, however, usually gain spurious information on the unobservable directions, leading to over-confident covariance inconsistent with actual estimator errors. The consistency problem of fixed-lag smoothers (FLSs) has only been attacked by the first estimate Jacobian (FEJ) technique because of the complexity to analyze their observability property. But the FEJ has several drawbacks hampering its wide adoption. To ensure the consistency of a FLS, this paper introduces the right invariant error formulation into the FLS framework. To our knowledge, we are the first to analyze the observability of a FLS with the right invariant error. Our main contributions are twofold. As the first novelty, to bypass the complexity of analysis with the classic observability matrix, we show that observability analysis of FLSs can be done equivalently on the linearized system. Second, we prove that the inconsistency issue in the traditional FLS can be elegantly solved by the right invariant error formulation without artificially correcting Jacobians. By applying the proposed FLS to the monocular visual inertial simultaneous localization and mapping (SLAM) problem, we confirm that the method consistently estimates covariance similarly to a batch smoother in simulation and that our method achieved comparable accuracy as traditional FLSs on real data.

#6 Supervised Training of Dense Object Nets using Optimal Descriptors for Industrial Robotic Applications [PDF] [Copy] [Kimi]

Authors: Andras Gabor Kupcsik ; Markus Spies ; Alexander Klein ; Marco Todescato ; Nicolai Waniek ; Philipp Schillinger ; Mathias Bürger

Dense Object Nets (DONs) by Florence, Manuelli and Tedrake (2018) introduced dense object descriptors as a novel visual object representation for the robotics community. It is suitable for many applications including object grasping, policy learning, etc. DONs map an RGB image depicting an object into a descriptor space image, which implicitly encodes key features of an object invariant to the relative camera pose. Impressively, the self-supervised training of DONs can be applied to arbitrary objects and can be evaluated and deployed within hours. However, the training approach relies on accurate depth images and faces challenges with small, reflective objects, typical for industrial settings, when using consumer grade depth cameras. In this paper we show that given a 3D model of an object, we can generate its descriptor space image, which allows for supervised training of DONs. We rely on Laplacian Eigenmaps (LE) to embed the 3D model of an object into an optimally generated space. While our approach uses more domain knowledge, it can be efficiently applied even for smaller and reflective objects, as it does not rely on depth information. We compare the training methods on generating 6D grasps for industrial objects and show that our novel supervised training approach improves the pick-and-place performance in industry-relevant tasks.

#7 DenserNet: Weakly Supervised Visual Localization Using Multi-Scale Feature Aggregation [PDF] [Copy] [Kimi]

Authors: Dongfang Liu ; Yiming Cui ; Liqi Yan ; Christos Mousas ; Baijian Yang ; Yingjie Chen

In this work, we introduce a Denser Feature Network(DenserNet) for visual localization. Our work provides three principal contributions. First, we develop a convolutional neural network (CNN) architecture which aggregates feature maps at different semantic levels for image representations. Using denser feature maps, our method can produce more key point features and increase image retrieval accuracy. Second, our model is trained end-to-end without pixel-level an-notation other than positive and negative GPS-tagged image pairs. We use a weakly supervised triplet ranking loss to learn discriminative features and encourage keypoint feature repeatability for image representation. Finally, our method is computationally efficient as our architecture has shared features and parameters during forwarding propagation. Our method is flexible and can be crafted on a light-weighted backbone architecture to achieve appealing efficiency with a small penalty on accuracy. Extensive experiment results indicate that our method sets a new state-of-the-art on four challenging large-scale localization benchmarks and three image retrieval benchmarks with the same level of supervision. The code is available at https://github.com/goodproj13/DenserNet

#8 Learning Intuitive Physics with Multimodal Generative Models [PDF] [Copy] [Kimi]

Authors: Sahand Rezaei-Shoshtari ; Francois R. Hogan ; Michael Jenkin ; David Meger ; Gregory Dudek

Predicting the future interaction of objects when they come into contact with their environment is key for autonomous agents to take intelligent and anticipatory actions. This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes. Visual information captures object properties such as 3D shape and location, while tactile information provides critical cues about interaction forces and resulting object motion when it makes contact with the environment. Utilizing a novel See-Through-your-Skin (STS) sensor that provides high resolution multimodal sensing of contact surfaces, our system captures both the visual appearance and the tactile properties of objects. We interpret the dual stream signals from the sensor using a Multimodal Variational Autoencoder (MVAE), allowing us to capture both modalities of contacting objects and to develop a mapping from visual to tactile interaction and vice-versa. Additionally, the perceptual system can be used to infer the outcome of future physical interactions, which we validate through simulated and real-world experiments in which the resting state of an object is predicted from given initial conditions.

#9 SCAN: A Spatial Context Attentive Network for Joint Multi-Agent Intent Prediction [PDF] [Copy] [Kimi]

Authors: Jasmine Sekhon ; Cody Fleming

Safe navigation of autonomous agents in human centric environments requires the ability to understand and predict motion of neighboring pedestrians. However, predicting pedestrian intent is a complex problem. Pedestrian motion is governed by complex social navigation norms, is dependent on neighbors' trajectories and is multimodal in nature. In this work, we propose SCAN, a Spatial Context Attentive Network that can jointly predict socially-acceptable multiple future trajectories for all pedestrians in a scene. SCAN encodes the influence of spatially close neighbors using a novel spatial attention mechanism in a manner that relies on fewer assumptions, is parameter efficient, and is more interpretable compared to state-of-the-art spatial attention approaches. Through experiments on several datasets we demonstrate that our approach can also quantitatively outperform state of the art trajectory prediction methods in terms of accuracy of predicted intent.

#10 IDOL: Inertial Deep Orientation-Estimation and Localization [PDF] [Copy] [Kimi]

Authors: Scott Sun ; Dennis Melamed ; Kris Kitani

Many smartphone applications use inertial measurement units (IMUs) to sense movement, but the use of these sensors for pedestrian localization can be challenging due to their noise characteristics. Recent data-driven inertial odometry approaches have demonstrated the increasing feasibility of inertial navigation. However, they still rely upon conventional smartphone orientation estimates that they assume to be accurate, while in fact these orientation estimates can be a significant source of error. To address the problem of inaccurate orientation estimates, we present a two-stage, data-driven pipeline using a commodity smartphone that first estimates device orientations and then estimates device position. The orientation module relies on a recurrent neural network and Extended Kalman Filter to obtain orientation estimates that are used to then rotate raw IMU measurements into the appropriate reference frame. The position module then passes those measurements through another recurrent network architecture to perform localization. Our proposed method outperforms state-of-the-art methods in both orientation and position error on a large dataset we constructed that contains 20 hours of pedestrian motion across 3 buildings and 15 subjects. Code and data are available at https://github.com/KlabCMU/IDOL.

#11 Differentiable Fluids with Solid Coupling for Learning and Control [PDF] [Copy] [Kimi]

Authors: Tetsuya Takahashi ; Junbang Liang ; Yi-Ling Qiao ; Ming C. Lin

We introduce an efficient differentiable fluid simulator that can be integrated with deep neural networks as a part of layers for learning dynamics and solving control problems. It offers the capability to handle one-way coupling of fluids with rigid objects using a variational principle that naturally enforces necessary boundary conditions at the fluid-solid interface with sub-grid details. This simulator utilizes the adjoint method to efficiently compute the gradient for multiple time steps of fluid simulation with user defined objective functions. We demonstrate the effectiveness of our method for solving inverse and control problems on fluids with one-way coupled solids. Our method outperforms the previous gradient computations, state-of-the-art derivative-free optimization, and model-free reinforcement learning techniques by at least one order of magnitude.

#12 CMAX++ : Leveraging Experience in Planning and Execution using Inaccurate Models [PDF] [Copy] [Kimi]

Authors: Anirudh Vemula ; J. Andrew Bagnell ; Maxim Likhachev

Given access to accurate dynamical models, modern planning approaches are effective in computing feasible and optimal plans for repetitive robotic tasks. However, it is difficult to model the true dynamics of the real world before execution, especially for tasks requiring interactions with objects whose parameters are unknown. A recent planning approach, CMAX, tackles this problem by adapting the planner online during execution to bias the resulting plans away from inaccurately modeled regions. CMAX, while being provably guaranteed to reach the goal, requires strong assumptions on the accuracy of the model used for planning and fails to improve the quality of the solution over repetitions of the same task. In this paper we propose CMAX++, an approach that leverages real-world experience to improve the quality of resulting plans over successive repetitions of a robotic task. CMAX++ achieves this by integrating model-free learning using acquired experience with model-based planning using the potentially inaccurate model. We provide provable guarantees on the completeness and asymptotic convergence of CMAX++ to the optimal path cost as the number of repetitions increases. CMAX++ is also shown to outperform baselines in simulated robotic tasks including 3D mobile robot navigation where the track friction is incorrectly modeled, and a 7D pick-and-place task where the mass of the object is unknown leading to discrepancy between true and modeled dynamics.

#13 Generative Partial Visual-Tactile Fused Object Clustering [PDF] [Copy] [Kimi]

Authors: Tao Zhang ; Yang Cong ; Gan Sun ; Jiahua Dong ; Yuyang Liu ; Zhengming Ding

Visual-tactile fused sensing for object clustering has achieved significant progresses recently, since the involvement of tactile modality can effectively improve clustering performance. However, the missing data (i.e., partial data) issues always happen due to occlusion and noises during the data collecting process. This issue is not well solved by most existing partial multi-view clustering methods for the heterogeneous modality challenge. Naively employing these methods would inevitably induce a negative effect and further hurt the performance. To solve the mentioned challenges, we propose a Generative Partial Visual-Tactile Fused (i.e., GPVTF) framework for object clustering. More specifically, we first do partial visual and tactile features extraction from the partial visual and tactile data, respectively, and encode the extracted features in modality-specific feature subspaces. A conditional cross-modal clustering generative adversarial network is then developed to synthesize one modality conditioning on the other modality, which can compensate missing samples and align the visual and tactile modalities naturally by adversarial learning. To the end, two pseudo-label based KL-divergence losses are employed to update the corresponding modality-specific encoders. Extensive comparative experiments on three public visual-tactile datasets prove the effectiveness of our method.

#14 VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization [PDF] [Copy] [Kimi]

Authors: Kaichen Zhou ; Changhao Chen ; Bing Wang ; Muhamad Risqi U. Saputra ; Niki Trigoni ; Andrew Markham

Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/Zalex97/VMLoc.